In the rapidly evolving world of finance and technology, investors are constantly seeking ways to make smarter decisions by combining traditional financial analysis with emerging technological insights. While stock market trends provide a numerical perspective on growth, an organization’s initiatives in cutting-edge fields like Artificial Intelligence (AI) reveal its future readiness and innovation potential. However, analyzing both dimensions - quantitative financial performance and qualitative AI initiatives - requires sifting through multiple, diverse data sources: stock data from platforms like Yahoo Finance, reports in PDFs, and contextual reasoning using Large Language Models (LLMs).
This is where DualLens Analytics comes in. By applying a dual-lens approach, the project leverages Retrieval-Augmented Generation (RAG) to merge financial growth data with strategic insights from organizational reports. Stock data provides evidence of stability and momentum, while AI initiative documents reveal forward-looking innovation. Together, they form a richer, more holistic picture of organizational potential.
With DualLens Analytics, investors no longer need to choose between numbers and narratives—they gain a unified, AI-driven perspective that ranks organizations by both financial strength and innovation readiness, enabling smarter, future-focused investment strategies.
Traditional investment analysis often focuses on financial metrics alone (e.g., stock growth, revenue, market cap), missing the qualitative dimension of how prepared a company is for the future. On the other hand, qualitative documents like strategy PDFs contain valuable insights about innovation and AI initiatives, but they are difficult to structure, query, and integrate with numeric financial data.
This leads to three core challenges:
Fragmented Data Sources: Financial data (stock prices) and strategic insights (PDFs) exist in silos.
Limited Analytical Scope: Manual analysis of growth trends and PDF reports is time-consuming and error-prone.
Decisional Blind Spots: Without integrating both quantitative (growth trends) and qualitative (AI initiatives) signals, investors may miss out on high-potential organizations.
To address this challenge, we set out to build a Retrieval-Augmented Generation (RAG) powered system that blends financial trends with AI-related strategic insights, helping investors rank organizations based on growth trajectory and innovation capacity.
You need to look for "--- --- ---" and add your code over there, this is a placeholder.</font>
# @title Run this cell => Restart the session => Start executing the below cells **(DO NOT EXECUTE THIS CELL AGAIN)**
!pip install langchain==0.3.25 \
langchain-core==0.3.65 \
langchain-openai==0.3.24 \
chromadb==0.6.3 \
langchain-community==0.3.20 \
pypdf==5.4.0
import yfinance as yf # Used for gathering stock prices
import matplotlib.pyplot as plt # Used for Data Visualization / Plots / Graphs
import pandas as pd # Helpful for working with tabular data like DataFrames
import os # Interacting with the operating system
from langchain.text_splitter import RecursiveCharacterTextSplitter # Helpful in splitting the PDF into smaller chunks
from langchain_community.document_loaders import PyPDFDirectoryLoader, PyPDFLoader # Loading a PDF
from langchain_community.vectorstores import Chroma # Vector DataBase
Selecting the below five organizations as the analysis pool.
companies = ["GOOGL", "MSFT", "IBM", "NVDA", "AMZN"]
config.json file should contain API_KEY and API BASE URL provided by OpenAI.config.json file and extracts the API details.API_KEY is a unique secret key that authorizes your requests to OpenAI's API.OPENAI_API_BASE is the API BASE URL where the model will process your requests.What To Do?
config.json file provided.The config.json should look like this:
{
"API_KEY": "your_openai_api_key_here",
"OPENAI_API_BASE": "https://your_openai_api_base/v1"
}
#Loading the `config.json` file
import json
import os
# Load the JSON file and extract values
file_name = "config.json"
with open(file_name, 'r') as file:
config = json.load(file)
os.environ['OPENAI_API_KEY'] = config["API_KEY"] # Loading the API Key
os.environ["OPENAI_BASE_URL"] = config["OPENAI_API_BASE"] # Loading the API Base Url
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4o-mini", # "gpt-4o-mini" to be used as an LLM
temperature=0, # Set the temprature to 0
max_tokens=5000, # Set the max_tokens = 5000, so that the long response will not be clipped off
top_p=0.95,
frequency_penalty=1.2,
stop_sequences=['INST']
)
Gather stock data for the selected organization from the past three years using the YFinance library, and visualize this data for enhanced analysis.
**Your Task**
- Loop through each company to retrieve stock data of the last three years using the YFinance library.
- Plot the closing prices for each company.
plt.figure(figsize=(14,7))
# Loop through each company and plot closing prices
for symbol in companies:
ticker = yf.Ticker(symbol)
data = ticker.history(period="3y")
# Plot closing price
plt.plot(data.index, data['Close'], label=symbol)
plt.title("Stock Price Trends (Last 3 Years)")
plt.xlabel("Date")
plt.ylabel("Price (USD)")
plt.legend()
plt.grid(True)
plt.savefig("Stock_Price_Trends_3Y.png")
plt.show()
**Your Task**
Tip: Check ticker.info for the available financial metrics
import pandas as pd
import matplotlib.pyplot as plt
companies = ["GOOGL", "MSFT", "IBM", "NVDA", "AMZN", "META"]
metrics_list = {}
# Fetching the financial metrics
for symbol in companies: # Loop through all the companies
ticker = yf.Ticker(symbol)
info = ticker.info
metrics_list[symbol] = { # Define the dictionary of all the Finanical Metrics
"Market Cap": info.get("marketCap", 0),
"P/E Ratio": info.get("trailingPE", 0),
"P/E Growth Ratio": info.get("trailingPegRatio", 0),
"Total Revenue": info.get("totalRevenue", 0),
"Return on Equity (ROE)": info.get("returnOnEquity", 0),
"Free Cash Flow": info.get("freeCashflow", 0),
"Price-to-Book (P/B) Ratio": info.get("priceToBook", 0),
"Debt-to-Equity Ratio": info.get("debtToEquity", 0),
"Dividend Yield": info.get("dividendYield", 0),
"Beta": info.get("beta", 0)
}
# Convert to DataFrame
df = pd.DataFrame(metrics_list).T
# Converting large numbers to billions for readability by divinding the whole column by 1e9
df["Market Cap"] = df["Market Cap"] / 1e9
df["Total Revenue"] = df["Total Revenue"] / 1e9
df["Free Cash Flow"] = df["Free Cash Flow"] / 1e9
df["Dividend Yield"] = df["Dividend Yield"] * 100 # Convert to percentage
df # Printing the df
import matplotlib.pyplot as plt
import math
metrics_to_plot = df.columns.tolist()
colors = [
"tab:blue", "tab:orange", "tab:green", "tab:red", "tab:purple",
"tab:brown", "tab:pink", "tab:gray", "tab:olive", "tab:cyan"
]
n_cols = 3
n_rows = math.ceil(len(metrics_to_plot) / n_cols)
fig, axes = plt.subplots(
n_rows,
n_cols,
figsize=(18, 4 * n_rows) # 👈 key change
)
axes = axes.flatten()
for i, metric in enumerate(metrics_to_plot):
ax = axes[i]
ax.bar(df.index, df[metric], color=colors[i % len(colors)])
ax.set_title(f"{metric} Comparison")
ax.set_ylabel(metric)
ax.set_xlabel("Company")
ax.grid(axis='y')
# Remove unused subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout(pad=2.0)
plt.show()
Performing the RAG-Driven Analysis on the AI Initiatives of the companies
**Your Task**
# Unzipping the AI Initiatives Documents
import zipfile
with zipfile.ZipFile("/content/pdf_data/Companies-AI-Initiatives.zip", 'r') as zip_ref:
zip_ref.extractall("/content/pdf_data") # Storing all the unzipped contents in this location
# Path of all AI Initiative Documents
ai_initiative_pdf_paths = [f"/content/pdf_data/Companies-AI-Initiatives/{file}" for file in os.listdir("/content/pdf_data/Companies-AI-Initiatives")]
ai_initiative_pdf_paths
from langchain_community.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader(path = "/content/pdf_data/Companies-AI-Initiatives/") # Creating an PDF loader object
# Defining the text splitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
encoding_name='cl100k_base',
chunk_size=1000,
chunk_overlap=200
)
# Splitting the chunks using the text splitter
ai_initiative_chunks = loader.load_and_split(text_splitter)
# Total length of all the chunks
len(ai_initiative_chunks)
# Defining the 'text-embedding-ada-002' as the embedding model
from langchain_openai import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")
# Creating a Vectorstore, storing all the above created chunks using an embedding model
vectorstore = Chroma.from_documents(
ai_initiative_chunks,
embedding_model,
collection_name="AI_Initiatives"
)
# Ignore if it gives an error or warning
You can safely ignore this error. It is a known, harmless telemetry issue in Chroma and does NOT affect your vector store, embeddings, or retrieval.
Chroma tries to send anonymous usage telemetry
There’s a version mismatch between Chroma and its telemetry dependency (posthog)
The telemetry call fails
Your vectorstore is still created successfully
✅ Your data is stored ✅ Embeddings work ✅ Similarity search works
# Creating an retriever object which can fetch ten similar results from the vectorstore
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={'k': 10}
)
user_message = "Give me the best project that `IBM` company is working upon"
# Building the context for the query using the retrieved chunks
relevant_document_chunks = retriever.get_relevant_documents(user_message)
context_list = [d.page_content for d in relevant_document_chunks]
context_for_query = ". ".join(context_list)
len(relevant_document_chunks)
# Write a system message for an LLM to help craft a response from the provided context
qna_system_message = """
You are an assistant whose work is to review the articles and provide the appropriate answers from the context.
User input will have the context required by you to answer user questions.
This context will begin with the token: ###Context.
The context contains references to specific portions of a document relevant to the user query.
User questions will begin with the token: ###Question.
Please answer only using the context provided in the input. Do not mention anything about the context in your final answer.
If the answer is not found in the context, respond "I don't know".
"""
# Write an user message template which can be used to attach the context and the questions
qna_user_message_template = """
###Context
Here are some documents that are relevant to the question mentioned below.
{context}
###Question
{question}
"""
# Format the prompt
formatted_prompt = f"""[INST]{qna_system_message}\n
{'user'}: {qna_user_message_template.format(context=context_for_query, question=user_message)}
[/INST]"""
# Make the LLM call
resp = llm.invoke(formatted_prompt)
resp.content
# Define RAG function
def RAG(user_message):
"""
Args:
user_message: Takes a user input for which the response should be retrieved from the vectorDB.
Returns:
relevant context as per user query.
"""
relevant_document_chunks = retriever.get_relevant_documents(user_message)
context_list = [d.page_content for d in relevant_document_chunks]
context_for_query = ". ".join(context_list)
# Combine qna_system_message and qna_user_message_template to create the prompt
prompt = f"""[INST]{qna_system_message}\n
{'user'}: {qna_user_message_template.format(context=context_for_query, question=user_message)}
[/INST]"""
# Quering the LLM
try:
response = llm.invoke(prompt)
except Exception as e:
response = f'Sorry, I encountered the following error: \n {e}'
return response.content
# Test Cases
print(RAG("How is the area in which GOOGL is working different from the area in which MSFT is working?"))
print(RAG("What are the three projects on which MSFT is working upon?"))
print(RAG("What is the timeline of each project in NVDA?"))
print(RAG("What are the areas in which AMZN is investing when it comes to AI?"))
print(RAG("What are the risks associated with projects within GOOG?"))
# Writing a question for performing evaluations on the RAG
evaluation_test_question = "What are the three projects on which MSFT is working upon?"
# Building the context for the evaluation test question using the retrieved chunks
relevant_document_chunks = retriever.get_relevant_documents(evaluation_test_question)
context_list = [d.page_content for d in relevant_document_chunks]
context_for_query = ". ".join(context_list)
# Default RAG Answer
answer = RAG(evaluation_test_question)
print(answer)
# Defining user messsage template for evaluation
evaluation_user_message_template = """
###Question
{question}
###Context
{context}
###Answer
{answer}
"""
# Writing the system message and the evaluation metrics for checking the groundedness
groundedness_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.
Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely
Metric:
The answer should be derived only from the information presented in the context
Instructions:
1. First write down the steps that are needed to evaluate the answer as per the metric.
2. Give a step-by-step explanation if the answer adheres to the metric considering the question and context as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the answer using the evaluaton criteria and assign a score.
"""
# Combining groundedness_rater_system_message + llm_prompt + answer for evaluation
groundedness_prompt = f"""[INST]{groundedness_rater_system_message}\n
{'user'}: {evaluation_user_message_template.format(context=context_for_query, question=evaluation_test_question, answer=answer)}
[/INST]"""
# Defining a new LLM object
groundness_checker = ChatOpenAI(
model="gpt-4o-mini",
temperature=0,
max_tokens=500,
top_p=0.95,
frequency_penalty=1.2,
stop_sequences=['INST']
)
# Using the LLM-as-Judge for evaluating Groundedness
groundness_response = groundness_checker.invoke(groundedness_prompt)
print(groundness_response.content)
# Writing the system message and the evaluation metrics for checking the relevance
relevance_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.
Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely
Metric:
Relevance measures how well the answer addresses the main aspects of the question, based on the context.
Consider whether all and only the important aspects are contained in the answer when evaluating relevance.
Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.
"""
# Combining relevance_rater_system_message + llm_prompt + answer for evaluation
relevance_prompt = f"""[INST]{relevance_rater_system_message}\n
{'user'}: {evaluation_user_message_template.format(context=context_for_query, question=evaluation_test_question, answer=answer)}
[/INST]"""
# Defining a new LLM object
relevance_checker = ChatOpenAI(
model="gpt-4o-mini",
temperature=0,
max_tokens=500,
top_p=0.95,
frequency_penalty=1.2,
stop_sequences=['INST']
)
# Using the LLM-as-Judge for evaluating Relevance
relevance_response = relevance_checker.invoke(relevance_prompt)
print(relevance_response.content)
Prompting an LLM to score each company by integrating Quantitative data (stock trend, growth metrics) and Qualitative evidence (PDF insights)
**Your Task**
- Write a system message and a user message that outlines the required data for the prompt.
- Prompt the LLM to rank and recommend companies for investment based on the provided PDF and stock data to achieve better returns.
# Fetching all the links of the documents
len(vectorstore.get()['documents'])
# Write a system message for instructing the LLM for scoring and ranking the companies
system_message = """
You are a financial analyst assistant. Your task is to evaluate and rank a list of companies for investment potential.
You will be provided with two types of data:
1. Quantitative financial growth metrics such as market capitalization, P/E ratio, P/E Growth Ratio, Total Revenue, Return on Equity (ROE), Free Cash Flow, Price-to-Book (P/B) Ratio, Debt-to-Equity Ratio, Dividend Yield, and beta.
2. Qualitative strategic insights extracted from organizational reports (AI initiatives and other strategic information).
Your goal is to analyze both the quantitative data and qualitative insights to score each company on overall growth potential, innovation, risk, and strategic positioning.
You need to rank the companies from most to least recommended for investment, providing a brief explanation of your reasoning behind the top 3 picks.
Be clear, concise, and justify your rankings based on both data types.
"""
# Write a user message for instructing the LLM for scoring and ranking the companies
user_message = f"""
You are given:
---
### 1. Financial Data (Quantitative)
{df.to_string()}
---
### 2. Strategic Insights (Qualitative)
{vectorstore.get()['documents']}
---
Please score and rank these companies from best to worst investment opportunities. Consider financial growth metrics *and* the qualitative strategic insights.
Provide:
- A ranked list of companies
- Scores or ratings for each company
- Key reasons supporting the rankings, emphasizing strengths and risks
Your evaluation should help an investor decide where to allocate capital for better returns.
The recommendation.content should be clear and concise so that it is nicely formatted in google colab.
"""
# Formatting the prompt
formatted_prompt = f"""[INST]{system_message}\n
{'user'}: {user_message}
[/INST]"""
# Calling the LLM
recommendation = llm.invoke(formatted_prompt)
recommendation.content
# print(recommendation.content)
from IPython.display import display, Markdown
display(Markdown(recommendation.content))
Based on the project, learners are expected to share their observations, key learnings, and insights related to the business use case, including any challenges they encountered. Additionally, they should recommend improvements to the project and suggest further steps for enhancement.
A. Summary / Your Observations about this Project - 2 Marks
The project effectively combines financial growth metrics with strategic insights from organizational reports using a RAG-based DualLens approach.
Investment rankings align with market leaders (Microsoft, NVIDIA, Google), indicating accurate retrieval and meaningful synthesis by the LLM.
The approach improves explainability, as recommendations are supported by both quantitative data and qualitative strategy analysis.
B. Recommendations for this Project / What improvements can be made to this Project - 2 Marks
Introduce a structured, weighted scoring framework to improve consistency and reduce subjectivity in LLM-generated scores.
Add time-aware retrieval and risk/confidence indicators to avoid outdated strategic insights and improve reliability.
Validate the system through backtesting and portfolio-level analysis to measure real-world investment performance.
Can be improved by adding more data points from yahoo library for functional analysis.
Can use two seperate recommendations for short and long term investments.